The goal of this project was to predict prices for houses in King County, Washington. The county is unusually large, having nearly twice the land area of Rhode Island and containing the thirteenth largest county population in the United States. Additionally, it contains a range of geographies and settings, including a metropolis (Seattle), several lakes, and more sparse rural areas. Because of the great variety found in the county, many variables ended up as features in the final predictive model.
Data was examined from 17384 houses sold in the county between 2014 and 2015 in order to construct the model. Their locations are plotted on the county map below:
Data used in this collection was copied from:
https://github.com/STAT-ATA-ASU/STT3851Spring2016/blob/gh-pages/Data/housedata.csv
on March 31, 2016.
Most of the variables are self-explanatory, but a glossary is available from the King County Department of Assessments.
The variables id and date were removed from the data. The variables waterfront, condition, grade, and zipcode were converted from numeric values to factors. The variable yr_renovated was set to the corresponding yr_built for any houses that were missing yr_renovated values - that is, any houses that had not been renovated had their renovation dates re-set to the dates they were originally built. The month and day for both of these variables was set to September 1st. Finally, a new variable, lot_size, was introduced based on the realtor lot size categories reported in the paper Modeling Home Prices Using Realtor Data (Pardoe (2008)).
The cleaned data including initial variables can be seen and explored in the following table:
Exploratory analysis was performed by examining plots and single-variable regressions for various features onto price. A few examples follow:
At this point, the variables grade and condition were collapsed to account for limited observations and limited distinct effect in their lower categories. The levels 1 - 6 from grade were placed into one new lowest category 1, and the levels poor and fair from condition were placed into one new lowest category poor-fair. Essentially, any houses below average in grade or condition were grouped together.
Standard multivariate regression techniques such as those described in James et al. (2013) were used to develop a predictive model.
All analyses performed in this paper can be reproduced by running the original .Rmd file with RStudio, assuming the original housing data file has been downloaded to the user’s working directory as “housingdata.csv”. The R packages car (Fox and Weisberg 2015), ggplot2 (Wickham and Chang 2016), knitr (Xie 2016a), rmarkdown (Allaire et al. 2016), MASS (Ripley 2015), DT (Xie 2015), boot (Canty and Ripley 2016), ggmap (Kahle and Wickham 2016) and bookdown (Xie 2016b) will need to be installed on the user’s computer.
Feature selection was first performed by comparing AIC in a stepwise algorithm using the MASS package function stepAIC on the cleaned data including all variables. The forward-selected model (modfs) suggested inclusion of all features except sqft_basement. It reported an adjusted R\(^2\) value of 0.8355252.
The second model (mod2) included polynomial features as suggested by residual plots of the forward-selected model, as well as interaction features that might be expected to contribute to housing price, including bedrooms:bathrooms, sqft_living:sqft_lot, sqft_lot15:sqft_lot, and sqft_living15:sqft_living. Because the waterfront variable seemed to have such an outsized effect (see plot in Exploratory Analysis section), interaction features between waterfront and several other variables were also introduced. Additionally, as suggested by the recent Seattle Times article King County home prices hit new highs (Bhatt (2016)), prices for smaller house sizes vary widely among different parts of the county, so interactions between zipcode:sqft_living and zipcode:sqft_lot were introduced. This model reported an adjusted R\(^2\) value of 0.8936747.
The final model (mod3) simplified the previous model by dropping several first-degree polynomials and interactions with low significance (p-value > .05). This model reported an adjusted R\(^2\) value of 0.8900775.
5-fold cross-validation was performed using all three models with the cv.glm function from the boot package. mod3 consistently achieved the lowest mean squared test error. Results of ten repetitions of 5-fold cross-validation tests on all three models are plotted below:
The goal of the model was to predict housing prices, so a handful of price predictions have been computed using the final model and are reported in the table below. The mean predicted price (Estimate), lower (Lower) and upper (Upper) limits of a 95% confidence interval are given based on differing variable inputs.
Allaire, JJ, Joe Cheng, Yihui Xie, Jonathan McPherson, Winston Chang, Jeff Allen, Hadley Wickham, Aron Atkins, and Rob Hyndman. 2016. Rmarkdown: Dynamic Documents for R. http://CRAN.R-project.org/package=rmarkdown.
Bhatt, Sanjay. 2016. “King County Home Prices Hit New Highs, Inventory at New Lows.” The Seattle Times, January. http://www.seattletimes.com/business/real-estate/king-county-home-prices-hit-a-new-record-in-december/.
Canty, Angelo, and Brian Ripley. 2016. Boot: Bootstrap Functions (Originally by Angelo Canty for S). http://CRAN.R-project.org/package=boot.
Fox, John, and Sanford Weisberg. 2015. Car: Companion to Applied Regression. http://CRAN.R-project.org/package=car.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani, eds. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics 103. New York: Springer.
Kahle, David, and Hadley Wickham. 2016. Ggmap: Spatial Visualization with Ggplot2. http://CRAN.R-project.org/package=ggmap.
Pardoe, Iain. 2008. “Modeling Home Prices Using Realtor Data.” Journal of Statistics Education 16 (2). http://www.amstat.org/publications/jse/v16n2/datasets.pardoe.html.
Ripley, Brian. 2015. MASS: Support Functions and Datasets for Venables and Ripley’s MASS. http://CRAN.R-project.org/package=MASS.
Wickham, Hadley, and Winston Chang. 2016. Ggplot2: An Implementation of the Grammar of Graphics. http://CRAN.R-project.org/package=ggplot2.
Xie, Yihui. 2015. DT: A Wrapper of the JavaScript Library ’DataTables’. http://CRAN.R-project.org/package=DT.
———. 2016a. Knitr: A General-Purpose Package for Dynamic Report Generation in R. http://yihui.name/knitr/.
———. 2016b. Bookdown: Authoring Books with R Markdown. https://github.com/rstudio/bookdown.